While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?
The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you download from their GitHub repository.
For this project, you'll be cleaning and exploring the data set in Jupyter notebook. To see a sample notebook containing all of the answers, visit the project's GitHub repository.
In [42]:
import pandas as pd
star_wars = pd.read_csv('star_wars.csv', encoding='ISO=8859-1')
We need to specify an encoding because the data set has some characters that aren't in Python's default utf-8 encoding. You can read more about character encodings on developer Joel Spolsky's blog.
In [43]:
star_wars.head(10)
Out[43]:
In [44]:
star_wars.columns
Out[44]:
In [45]:
star_wars = star_wars[pd.notnull(star_wars['RespondentID'])]
star_wars.head()
Out[45]:
Some columns are currently string types, because the main values they contain are Yes and No. We can make the data a bit easier to analyze down the road by converting each column to a Boolean having only the values True, False, and NaN
In [46]:
bool_type = {
'Yes': True,
'No': False
}
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(bool_type)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(bool_type)
In [47]:
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].head()
Out[47]:
Change column name and bool values
In [48]:
import numpy as np
bool_type1 = {
"Star Wars: Episode I The Phantom Menace": True,
np.nan: False,
"Star Wars: Episode II Attack of the Clones": True,
"Star Wars: Episode III Revenge of the Sith": True,
"Star Wars: Episode IV A New Hope": True,
"Star Wars: Episode V The Empire Strikes Back": True,
"Star Wars: Episode VI Return of the Jedi": True
}
for col in star_wars.columns[3:9]:
star_wars[col] = star_wars[col].map(bool_type1)
star_wars = star_wars.rename(columns={
'Star Wars: Episode I The Phantom Menace': "seen_1",
'Unnamed: 4': 'seen_2',
'Unnamed: 5': 'seen_3',
'Unnamed: 6': 'seen_4',
'Unnamed: 7': 'seen_5',
'Unnamed: 8': 'seen_6'
})
In [49]:
star_wars.head()
Out[49]:
In [50]:
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
star_wars.columns[9:15]
Out[50]:
In [51]:
star_wars = star_wars.rename(columns={
'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': "ranking_1",
'Unnamed: 10': 'ranking_2',
'Unnamed: 11': 'ranking_3',
'Unnamed: 12': 'ranking_4',
'Unnamed: 13': 'ranking_5',
'Unnamed: 14': 'ranking_6'
})
In [52]:
star_wars.columns[9:15]
Out[52]:
Now that we've cleaned up the ranking columns, we can find the highest-ranked movie more quickly
In [53]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.bar(range(6), star_wars[star_wars.columns[9:15]].mean())
Out[53]:
The 5th movies (Episode V The Empire Strikes Back) has a highest rating (in this survey 1=best, 6=worst). "Episode III Revenge of the Sith" has a worst rate.
In [54]:
plt.bar(range(6), star_wars[star_wars.columns[3:9]].sum())
Out[54]:
We can figure out how many people have seen each movie just by taking the sum of the column. Earliest movies is more popular - this corresponds to ranking above (earlier have better ranking).
In [55]:
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]
In [56]:
## Redo the two previous analyses (find the most viewed movie and the highest-ranked movie) separately for each group
In [59]:
## find highest-ranked movie (lower is better)
plt.bar(range(6), females[females.columns[9:15]].mean())
plt.show()
plt.bar(range(6), males[males.columns[9:15]].mean())
plt.show()
In [60]:
## find most viewed movie (higher is better)
plt.bar(range(6), females[females.columns[3:9]].mean())
plt.show()
plt.bar(range(6), males[males.columns[3:9]].mean())
plt.show()
More males watch all episods but rate high only the earliest movies. Instead, less females watch new episode but rate this new ones better